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Abstract 

Online prediction methods are typicahy presented as serial algorithms running on a single 
processor. However, in the age of web-scale prediction problems, it is increasingly common 
to encounter situations where a single processor cannot keep up with the high rate at which 
inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of 
converting many serial gradient-based online prediction algorithms into distributed algo- 
rithms. We prove a regret bound for this method that is asymptotically optimal for smooth 
convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into 
account communication latencies between nodes in the distributed environment. We show 
how our method can be used to solve the closely-related distributed stochastic optimization 
problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we 
demonstrate the merits of our approach on a web-scale online prediction problem. 

Keywords: distributed computing, online learning, stochastic optimization, regret bounds, 
convex optimization 



1. Introduction 

Many natural prediction problems can be cast as stochastic online prediction problems. 
These are often discussed in the serial setting, where the computation takes place on a 
single processor. However, when the inputs arrive at a high rate and have to be processed 
in real time, there may be no choice but to distribute the computation across multiple 
cores or multiple cluster nodes. For example, modern search engines process thousands 
of queries a second, and indeed they are implemented as distributed algorithms that run 
in massive data-centers. In this paper, we focus on such large-scale and high-rate online 
prediction problems, where parallel and distributed computing is critical to providing a 
real-time service. 
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First, we begin by defining the stochastic online prediction problem. Suppose that we 
observe a stream of inputs zi,Z2, ■ ■ ■, where each Zi is sampled independently from a fixed 
unknown distribution over a sample space Z. Before observing each Zi, we predict a point 
Wi from a set W. After making the prediction Wi, we observe Zi and suffer the loss f{wi, Zi), 
where / is a predefined loss function. Then we use Zi to improve our prediction mechanism 
for the future (e.g., using a stochastic gradient method). The goal is to accumulate the 
smallest possible loss as we process the sequence of inputs. More specifically, we measure 
the quality of our predictions using the notion of regret, defined as 

m 

R{m) = '^{f{w^,Zi)- f{w*,z,)) , 
1=1 

where = arg min^g^^ Ezlfiw, z)]. Regret measures the difference between the cumulative 
loss of our predictions and the cumulative loss of the fixed predictor w*, which is optimal 
with respect to the underlying distribution. Since regret relies on the stochastic inputs Zi, 
it is a random variable. For simplicity, we focus on bounding the expected regret E[i?(m)], 
and later use these results to obtain high-probability bounds on the actual regret. In this 
paper, we restrict our discussion to convex prediction problems, where the loss function 
/{w, z) is convex in w for every z £ Z, and is a closed convex subset of M". 

Before continuing, we note that the stochastic online prediction problem is closely re- 
lated, but not identical, t o the stochastic optimi zation problem (see, e.g.. Wets . 19891 : 



Birge and Louveaux . 1997 : Nemirovski et al. . 20091 ). The main difference between the two 



is in their goals: in stochastic optimization, the goal is to generate a sequence 101,11)2, ■■ ■ 
that quickly converges to the minimizer of the function F{-) = Ez[f{-, z)]. The motivating 
application is usually a static (batch) problem, and not an online process that occurs over 
time. Large-scale static optimization problems can always be solved using a serial approach, 
at the cost of a longer running time. In online prediction, the goal is to generate a sequence 
of predictions that accumulates a small loss along the way, as measured by regret. The 
relevant motivating application here is providing a real-time service to users, so our algo- 
rithm must keep up with the inputs as they arrive, and we cannot choose to slow down. 
In this sense, distributed computing is critical for large-scale online prediction problems. 
Despite these important differences, our techniques and results can be readily adapted to 
the stochastic online optimization setting. 

We model our distributed computing system as a set of k nodes, each of which is an 
independent processor, and a network that enables the nodes to communicate with each 
other. Each node receives an incoming stream of examples from an outside source, such as 
a load balancer /splitter. As in the real world, we assume that the network has a limited 
bandwidth, so the nodes cannot simply share all of their information, and that messages 
sent over the network incur a non-negligible latency. However, we assume that network 
operations are non-blocking, meaning that each node can continue processing incoming 
traffic while network operations complete in the background. 

How well can we perform in such a distributed environment? At one extreme, an ideal 
(but unrealistic) solution to our problem is to run a serial algorithm on a single "super" 
processor that is k times faster than a standard node. This solution is optimal, simply 
because any distributed algorithm can be simulated on a fast-enough single processor. It 
is well-known that the optimal regret bound that can be achieved by a gradient-based 
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serial algorithm on an arbitrary convex loss is ( \/rn) (e.^., Nemirovski and Yudin . 19831 : 



Cesa-Bianchi and Lugosi . 2006 : Abernethv et all 20091 ). At the other extreme, a trivial 



solution to our problem is to have each node operate in isolation of the other k — 1 nodes, 
running an independent copy of a serial algorithm, without any communication over the 
network. We call this the no- communication solution. The main disadvantage of this 
solution is that the performance guarantee, as measured by regret, scales poorly with the 
network size k. More specifically, assuming that each node processes m/k inputs, the 
expected regret per node is 0{y^m/k). Therefore, the total regret across all k nodes is 
0{\/km) - namely, a factor of Vk worse than the ideal solution. The first sanity-check 
that any distributed online prediction algorithm must pass is that it outperforms the naive 
no-communication solution. 

In this paper, we present the distributed mini-batch (DMB) algorithm, a method of 
converting any serial gradient-based online prediction algorithm into a parallel or distributed 
algorithm. This method has two important properties: 

• It can use any gradient-based update rule for serial online prediction as a black box, 
and convert it into a parallel or distributed online prediction algorithm. 

• If the loss function f{w,z) is smooth in w (see the precise definition in Equation ([S])), 
then our method attains an asymptotically optimal regret bound of 0{y/rn). More- 
over, the coefficient of the dominant term y/m is the same as in the serial bound, and 
independent of k and of the network topology. 

The idea of using mini-batches in stochastic and online learning is i iot new, and has been 



previously explored in bo th the serial and parallel settings (see, e.g., IShalev-Shwartz et al. 



20071 : iGimpel et all , hoid ) . However, to the best of our knowledge, our work is the first to 



use this idea to obtain such strong results in a parallel and distributed learning setting (see 
Section [7] for a comparison to related work) . 

Our results build on the fact that the optimal regret bound for serial stochastic gradient- 
based prediction algorithms can be refined if the loss function is smooth. In particular, it can 
be shown that the hidden coefficient in the 0{y/rn) notation is propor tional to the standard 



deviation of the stochastic gradients evaluated at each predictor Wi (jJuditskv et al.l . 12011 



Lan, I2OO9I : IXiaol . I2OIQ ). We make the key observation that this coefficient can be effectively 



reduced by averaging a mini-batch of stochastic gradients computed at the same predictor, 
and this can be done in parallel with simple network communication. However, the non- 
negligible communication latencies prevent a straightforward parallel implementation from 
obtaining the optimal serial regret bound0 In order to close the gap, we show that by 
letting the mini-batch size grow slowly with m, we can attain the optimal 0{^/rn) regret 
bound, where the dominant term of order y/m is independent of the number of nodes k and 
of the latencies introduced by the network. 

The paper is organized as follows. In Section [21 we present a template for stochastic 
gradient-based serial prediction algorithms, and state refined variance-based regret bounds 
for smooth loss functions. In Section [3l we analyze the effect of using mini-batches in the 



1. For example, if the network communication operates over a minimum-depth spanning tree and the 
diameter of the network scales as log(A;), then we can show that a straightforward implementation of the 
idea of parallel variance reduction leads to an 0(^^m\og{k)^ regret bound. See Section [4] for details. 
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Algorithm 1: Template for a serial first-order stochastic online prediction algorithm. 

for J = 1, 2, . . . do 

predict wj; 

receive input Zj sampled i.i.d. from unknown distribution; 
suffer loss f{wj,Zj); 
define gj = Vwf{wj,Zj); 
compute {wj-^-i, aj+i) = (p {aj,gj, aj); 
end 



serial setting, and show that it does not significantly affect the regret bounds. In Section [H 
we present the DMB algorithm, and show that it achieves an asymptotically optimal serial 
regret bound for smooth loss functions. In Section [5l we show that the DMB algorithm 
attains the optimal rate of convergence for stochastic optimization, with an asymptotically 
linear speed-up. In Section [6l we complement our theoretical results with an experimental 
study on a realistic web-scale online prediction problem. While substantiating the effec- 
tiveness of our approach, our empirical results also demonstrate some interesting properties 
of mini-batching that are not reflected in our theory. We conclude with a comparison of 
our methods to previous work in Section [71 and a discussion of potential extensions and 
future resea r ch iii Section El The ma in topics presented in this paper are summarized in 



Dekel et al.l (|201lh . bekel et al.l (|201lh also present robust variants of our approach, which 



are resilient to failures and node heterogeneity in an asynchronous distributed environment. 
2. Variance Bounds for Serial Algorithms 

Before discussing distributed algorithms, we must fully understand the serial algorithms 
on which they are based. We focus on gradient-based optimization algorithms that follow 
the template outlined in Algorithm [H In this template, each prediction is made by an 
unspecified update rule: 

{wj+i,aj+i) = (f){aj,gj,aj). (1) 

The update rule (p takes three arguments: an auxiliary state vector aj that summarizes 
all of the necessary information about the past, a gradient gj of the loss function f(-,Zj) 
evaluated at wj, and an iteration-dependent parameter aj such as a stepsize. The update 
rule outputs the next predictor Wj+i G W and a new auxiliary state vector Oj+i. Plugging 
in different update rules results in different online prediction algorithms. For simplicity, we 
assume for now that the update rules are deterministic functions of their inputs. 

As concrete examples, we present two well-known update rules that fit the above tem- 
plate. The first is the projected gradient descent update rule, 

Wj+i = T^w (wj - —gj] , (2) 



where ttw denotes the Euclidean projection onto the set W. Here l/aj is a decaying learning 
rate, with aj typically set to be This fits the template in Algorithm [T] by defining aj 

to simply be wj, and defining cj) to correspond to the update rule specified in Equation dJj). 
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We note that the projected grad ient method is a special case of the more general class of 
mirror descent algorithms (e.g., Nemirovski et al. . 20091 : Lan . 20091 ). which all fit in the 
template of Equation ([T]) . 

Another family of upda te rules that fit in our setting is the dual averaging method 
(jNesterovl . lioOfll : Ixiaol . [ioiol ). A dual averaging update rule takes the form 



Wj-^-i = argmm 




w ) + ttj h{w) 



(3) 



where (•,•) denotes the vector inner product, /i : — )• M is a strongly convex auxiliary 
function, and Oj is a monotonically increasing sequence of positive numbers, usually set to 
be 6(-v/j)- The dual averaging update rule fits the template in Algorithm [T] by defining 



Uj to be Yli=i9i- special case where h{w) = (1/2) 

in Equation ^ has the closed-form solution 



\w\ 



2! 



the minimization problem 



Wj + l = TTW 




(4) 



For stochastic online prediction problems with convex loss functions, both of these up- 
date rules have expected regret bound of 0{y/m.). In general, the coefficient of the dominant 
^/m t erm is proportion al to an upper bound on the expected norm of the stochastic gradient 
fe.g.. lZinkevidI 120031 ). Next we present refined bounds for smooth convex loss functions, 
which enable us to develop optimal distributed algorithms. 



2.1 Optimal Regret Bounds for Smooth Loss Functions 

As stated in the introduction, we assume that the loss function f{w,z) is convex in w 
for each z £ Z and that W is a closed convex set. We use || • || to denote the Eu- 
clidean norm in M". For convenience, we use the notation F{w) = E,z[f{w,z)] and assume 
w* = alcgmm^^^r F{w) always exists. Our main results require a couple of additional 
assumptions: 

• Smoothness - we assume that / is L-smooth in its first argument, which means that 
for any z £ Z, the function f{-,z) has L-Lipschitz continuous gradients. Formally, 



\/z£Z, yWjw'GW, \\V m f {w , z) — V w f (w' , z)\\ < LWw 



w 



(5) 



Bounded Gradient Variance - we assume that \7wf{w,z) has a cr^-bounded variance 
for any fixed w, when z is sampled from the underlying distribution. In other words, 
we assume that there exists a constant o" > such that 



yw G w, 



\V^fiw,z)-VF{w) 



Using these assu mptions, regret bounds tha t expli c itly depend on the gradient variance 
can be established ( Juditsky et al. . 2011 : Lan . 20091 : Xiao . 2O10l ). In particular, for the 
projected stochastic gradient method defined in Equation ([2]), we have the following result: 
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Theorem 1 Let f{w,z) be an L-smooth convex loss function in w for each z £ Z and 
assume that the stochastic gradient V^/(it;,2;) has a'^-bounded variance for all w G W . In 
addition, assume that W is convex and bounded, and let D = y^max^^^g^^ ||n — f |p/2. Then 
using aj = L + {a /D)y/j in Equation ^ gives 

E[i?(m)] < {F{wi) - F{w'')) + D'^L + 2Dcry/^. 

In the above theorem, the assumption that is a bounded set does not play a critical 
role. Even if the learning problem has no constraints on we could always confine the 
search to a bounded set (say, a Euclidean ball of some radius) and Theorem [T] guarantees 
an 0{^/rn) regret compared to the optimum within that set. 

Similarly, for the dual averaging method defined in Equation ([3]), we have: 

Theorem 2 Let f{w, z) be an L-smooth convex loss function in w for each z £ Z, assume 
that the stochastic gradient Vwfiw, z) has -bounded variance for all w G W, and let D = 
■\/h{w*) — min^gvy h{w). Then, by setting wi = argmin^g^y h{uj) and aj = L + {a/D)y/J 
in the dual averaging method we have 

E[R{m)] < {F{wi) - F{w*)) + D^L + 2Da 



m. 



For both of the above theorems, if 'VF{w*) = (which is certainly the case if = M"'), 
then the expected regret bounds can be simplified to 

E[R{m)] < 2D'^L + 2Day/^ . (6) 

Proofs for these two theorems, as well as the above simplification, are given in Appendix [Al 
Although we focus on expected regret bounds here, our results can equally be stated as 
high-probability bounds on the actual regret (see Appendix [B] for details). 

In both Theorem [T] and Theorem [21 the parameters aj are functions of a. It may be 
difficult to obtain precise estimates of the gradient variance in many concrete applications. 
However, note that any upper bound on the variance suffices for the theoretical results 
to hold, and identifying such a bound is often easier than precisely estimating the actual 
variance. A loose bound on the variance will increase the constants in our regret bounds, 
but will not change its qualitative 0{^/m) rate. 

Euclidean gradient descent and dual averaging are not the only update rules that can be 
plugged into Algorithm [H The analysis in Appendix [A] (and Appendix [B]) actually applies 
to a much larger c l ass of update rul es, which includes the family of mirror descent updates 
( Nemirovski e t al. I '2OO9I; Tanl, '2OO9I) and the family of (non-Euclidean) dual averaging up- 



dates (iNester ov . 2009; Xiao, 2010). For each of these update rules, we get an expected 
regret bound that closely resembles the bound in Equation ([6]). 

Similar results can also be established for loss functions of the form f{w,z) -\- ^{w), 
where ^{w) is a simple convex regularization term that is not necessarily smooth. For 
example, setting ^{w) = X\\w\\i with A > promotes sparsity in the p redictor w. To 



extend the dual averaging method, we can use the following update rule in Ixiaol (l20ld ): 




Wj+i = argmin^ ( - y^gi, w ) -\-^{w) H — rh{w) 
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Simila r extensions to the mirror descent method can be found in, for example. iDuchi and Singer 

Jsing these composite forms of the algorithms, the same regret bounds as in The- 
orem [1] and Theorem [2] can be achieved even if ^(w) is non smooth. The analysis is almost 
identical to Appendix |A] by using the general framework of Tseng ( 20081 ). 

Asymptotically, the bounds we presented in this section are only controlled by the 
variance cr^ and the number of iterations m. Therefore, we can think of any of the bounds 



mentioned above as an abstract function , 
increasing in its arguments. 



m 



which we assume to be monotonically 



2.2 Analyzing the No-Communication Parallel Solution 

Using the abstract notation ip{a'^,m) for the expected regret bound simplifies our presen- 
tation significantly. As an example, we can easily give an analysis of the no-communication 
parallel solution described in the introduction. 

In the naive no-communication solution, each of the k nodes in the parallel system 
applies the same serial update rule to its own substream of the high-rate inputs, and no 
communication takes place between them. If the total number of examples processed by 
the k nodes is m, then each node processes at most [m/A;] inputs. The examples received 
by each node are i.i.d. from the original distribution, with the same variance bound 
for the stochastic gradients. Therefore, each node suffers an expected regret of at most 
■0(0"^, \m/k']) on its portion of the input stream, and the total regret bound is obtain by 
simply summing over the k nodes, that is, 

E[R{m)] < A;^ (o-^ 

If tp{a'^,m) = 2D^L + 2Day/rn^ as in Equation ([6]), then the expected total regret is 



E[i?(m)] < 2kD^L + 2Dak 

Comparing this bound to 2D^L + 2D(Ty/m in the ideal serial solution, we see that it is 
approximately ^/k times worse in its leading term. This is the price one pays for the lack 
of communication in the distributed system. In Section [U we show how this ^/k factor can 
be avoided by our DMB approach. 





3. Serial Online Prediction using Mini-Batches 

The expected regret bounds presented in the previous section depend on the variance of the 
stochastic gradients. The explicit dependency on the variance naturally suggests the idea 
of using averaged gradients over mini-batches to reduce the variance. Before we present the 
distributed mini-batch algorithm in the next section, we first analyze a serial mini-batch 
algorithm. 

In the setting described in Algorithm [H the update rule is applied after each input is 
received. We deviate from this setting and apply the update only periodically. Letting h 
be a user-defined hatch size (a positive integer), and considering every b consecutive inputs 
as a hatch. We define the serial mini-hatch algorithm as follows: Our prediction remains 
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Algorithm 2: Template for a serial mini-batch algorithm, 
for J = 1, 2, . . . do 
initiahze gj := 0; 
for s = 1, . . . , 6 do 

define i := {j — 1)6 + s; 
predict wj; 

receive input Zj sampled i.i.d. from unknown distribution; 
suffer loss f{wj,z.i); 
9i ■= Vrof{wj,Zi)] 
9j ■= 9j + {yb)gi; 
end 

set {wj+i,aj+i) = (j){aj,gj,aj); 
end 



constant for the duration of each batch, and is updated only when a batch ends. While 
processing the b inputs in batch j, the algorithm calculates and accumulates gradients and 
defines the average gradient 

1 ^ 

s=l 

Hence, each batch of b inputs generates a single average gradient. Once a batch ends, the 
serial mini-batch algorithm feeds gj to the update rule 4> as the j^^ gradient and obtains 
the new prediction for the next batch and the new state. See Algorithm [2] for a formal 
definition of the serial mini-batch algorithm. The appeal of the serial mini-batch setting is 
that the update rule is used less frequently, which may have computational benefits. 

Theorem 3 Let f{w,z) be an L-smooth convex loss function in w for each z £ Z and 
assume that the stochastic gradient Vwfiw,Zi) has a'^-bounded variance for all w. If the 
update rule (j) has the serial regret bound ip{a'^,m), then the expected regret of Algorithmic 
over m inputs is at most 









~b 



If il){cj'^ ,m) = 2D^L + 2Da^/m, then the expected regret is bounded by 

2bD^L + 2DaVm + b. 

Proof Assume without loss of generality that b divides m, and that the serial mini-batch 
algorithm processes exactly m/b complete batches H Let Z'^ denote the set of all sequences 
of b elements from Z, and assume that a sequence is sampled from Z'' by sampling each 
element i.i.d. from Z. Let / : W x Z^ he defined as 

1 ^ 

f{w,{zi,...,Zb)) = -^f{w,Zs) . 

s = l 

2. We can make this assumption since if b does not divide m then we can pad the input sequence with 
additional inputs until m/b = \m/b], and the expected regret can only increase. 
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In other words, / averages the loss function / across b inputs from Z, while keeping the 
prediction constant. It is straightforward to show that Ejg26/('w, z) = ¥,zi=zf{w, z) = F{w). 
Using the linearity of the gradient operator, we have 



1 ^ 



Zs) ■ 



s=l 



Let Zj denote the sequence • • • , Zj^), namely, the sequence of h inputs in batch j. 

The vector gj in Algorithm [2] is precisely the gradient of f{-,Zj) evaluated at wj. There- 
fore the serial mini-batch algorithm is equivalent to using the update rule (j) with the loss 
function /. 

Next we check the properties of f{w, z) against the two assumptions in Section 12.11 
First, if / is L-smooth then / is L-smooth as well due to the triangle inequality. Then we 
analyze the variance of the stochastic gradient. Using the properties of the Euclidean norm, 
we can write 



\V^fiw,z)-VFiw) 



1 

s=l 
b b 

^JZYl ^s) - VF{w),V^f{w, z,>) - VF{w)). 



=1 s'=l 

Notice that Zg and Zg' are independent whenever s ^ s' , and in such cases, 

e(v^/(u;, Zs) - VF{w),V^f{w, - VF{w) 
[E[V^fiw,Zs)-VF{w)], E[V^fiw,Zs')-VF{w)]) = 0. 

Therefore, we have for every w £ W, 

1 ^ 2 

E\\V^f{w,z)-VF{w)f = -g^Y.E\\{V^f{w,Zs)-VF{w))f < ^. 



(7) 



So we conclude that Vwf{w, Zj) has a (cT^/6)-bounded variance for each j and each w G W. 
If the update rule (j) has a regret bound ip{a'^,m) for the loss function / over m inputs, then 
its regret for / over ni/b batches is bounded as 



E 



m/b 

Y,{f{Wj,Zj)- f{w\Zj)) 
3 = 1 



< -0 



cj^ m 



By replacing / above with its definition, and multiplying both sides of the above inequality 
by b, we have 



E 



m/b 



Yl Yl {f{Wj,Zi)- f{w*,Zi)) 

j=li={j-l)b+l 



cj^ m 
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If ^(cr^, m) = ID"^ L+2D(Ty/m, then simply plugging in the general bound bt(j{'^^/b, ) 
and using < '"/fc + 1 gives the desired result. However, we note that the optimal 

algorithmic parameters, as specified in Theorem [T] and Theorem [21 must be changed to 
aj = L + {^jy/hD)^ to reflect the reduced variance jh in the mini-batch setting. ■ 

The bound in Theorem [3] is asymptotically equivalent to the 2D^L + IDa^pm regret 
bound for the basic serial algorithms presented in Section [2l In other words, performing 
the mini-batch update in the serial setting does not significantly hurt the performance of 
the update rule. On the other hand, it is also not surprising that using mini-batches in 
the serial setting does not improve the regret bound. After all, it is still a serial algorithm, 
and the bounds we presented in Section 12.11 are optimal. Nevertheless, our experiments 
demonstrate that in real- world scenarios, mini-batching can in fact have a very substantial 
positive effect on the transient performance of the online prediction algorithm, even in the 
serial setting (see Section [6] for details). Such positive effects are not captured by our 
asymptotic, worst-case analysis. 

4. Distributed Mini-Batch for Stochastic Onhne Prediction 

In this section, we show that in a distributed setting, the mini-batch idea can be exploited 
to obtain nearly optimal regret bounds. To make our setting as realistic as possible, we 
assume that any communication over the network incurs a latency. More specifically, we 
view the network as an undirected graph Q over the set of nodes, where each edge represents 
a bi-directional network link. If nodes u and v are not connected by a link, then any 
communication between them must be relayed through other nodes. The latency incurred 
between u and v is therefore proportional to the graph distance between them, and the 
longest possible latency is thus proportional to the diameter of Q. 

In addition to latency, we assume that the network has limited bandwidth. However, 
we would like to avoid the tedious discussion of data representation, compression schemes, 
error correcting, packet sizes, etc. Therefore, we do not explicitly quantify the bandwidth 
of the network. Instead, we require that the communication load at each node remains 
constant, and does not grow with the number of nodes k or with the rate at which the 
incoming functions arrive. 

Although we are free to use any communication model that respects the constraints of 
our network, we assume only the availability of a distributed vector-sum operation. This is 
a standarcH synchronized network operation. Each vector-sum operation begins with each 
node holding a vector t^j, and ends with each node holding the sum X]j=i ^j- This operation 
transmits messages along a rooted minimum-depth spanning-tree of which we denote by 
T: first the leaves of T send their vectors to their parents; each parent sums the vectors 
received from his children and adds his own vector; the parent then sends the result to his 
own parent, and so forth; ultimately the sum of all vectors reaches the tree root; finally, the 
root broadcasts the overall sum down the tree to all of the nodes. 

An elegant property of the vector-sum operation is that it uses each up-link and each 
down-link in T exactly once. This allows us to start vector-sum operations back-to-back. 

3. For example, all-reduce with the sum operation is a standard operation in MPI. 
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These vector-sum operations will run concurrently without creating network congestion 
on any edge of T. Furthermore, we assume that the network operations are non-blocking, 
meaning that each node can continue processing incoming inputs while the vector-sum oper- 
ation takes place in the background. This is a key property that allows us to efficiently deal 
with network latency. To formalize how latency affects the performance of our algorithm, 
let /U denote the number of inputs that are processed by the entire system during the period 
of time it takes to complete a vector-sum operation across the entire network. Usually /u 
scales linearly with the diameter of the network, or (for appropriate network architectures) 
logarithmically in the number of nodes k. 

4.1 The DMB Algorithm 

We are now ready to present a general technique for applying a deterministic update rule (f) 
in a distributed environment. This technique resembles the serial mini-batch technique 
described earlier, and is therefore called the distributed mini-batch algorithm, or DMB for 
short. 

Algorithm [3] describes a template of the DMB algorithm that runs in parallel on each 
node in the network, and Figure [U illustrates the overall algorithm work-flow. Again, let b 
be a batch size, which we will specify later on, and for simplicity assume that k divides b 
and /X. The DMB algorithm processes the input stream in batches j = 1,2,..., where 
each batch contains b -\- fi consecutive inputs. During each batch j, all of the nodes use a 
common predictor wj. While observing the first b inputs in a batch, the nodes calculate 
and accumulate the stochastic gradients of the loss function / at Wj. Once the nodes 
have accumulated b gradients altogether, they start a distributed vector-sum operation 
to calculate the sum of these b gradients. While the vector-sum operation completes in 
the background, n additional inputs arrive (roughly fj./k per node) and the system keeps 
processing them using the same predictor wj. The gradients of these additional n inputs 
are discarded (to this end, they do not need to be computed). Although this may seem 
wasteful, we show that this waste can be made negligible by choosing b appropriately. 

Once the vector-sum operation completes, each node holds the sum of the b gradients 
collected during batch j. Each node divides this sum by b and obtains the average gradi- 
ent, which we denote by gj. Each node feeds this average gradient to the update rule cp, 
which returns a new synchronized prediction Wj+i. In summary, during batch j each node 
processes {b + ^)/k inputs using the same predictor Wj, but only the first b/k gradients are 
used to compute the next predictor. Nevertheless, all inputs are counted in our regret 
calculation. 

If the network operations are conducted over a spanning tree, then an obvious variants 
of the DMB algorithm is to let the root apply the update rule to get the next predictor, and 
then broadcast it to all other nodes. This saves repeated executions of the update rule at 
each node (but requires interruption or modification of the standard vector-sum operations 
in the network communication model). Moreover, this guarantees all the nodes having the 
same predictor even with update rules that depends on some random bits. 

Theorem 4 Let f{w, z) be an L-smooth convex loss function in w for each z £ Z and 
assume that the stochastic gradient Vwf{w,Zi) has a"^ -bounded variance for all w G W. If 
the update rule (p has the serial regret bound ip{a'^,m), then the expected regret of Algorithmic 
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Algorithm 3: Distributed mini-batch (DMB) algorithm (running on each node). 

for J = 1, 2, . . . do 
initiahze gj := 0; 
for s = 1, ... ,h/k do 
predict Wj] 

receive input z sampled i.i.d. from unknown distribution; 
suffer loss f{wj,z)] 
compute g := Vwf{wj,z); 

9j ■= 9j + g; 

end 

call the distributed vector-sum to compute the sum of gj across all nodes; 
receive fi/k additional inputs and continue predicting using wj; 
finish vector-sum and compute average gradient gj by dividing the sum by b; 
set {wj+i,aj+i) = (j){aj,gj,aj); 
end 



1 2 •■• k 




Figure 1: Work flow of the DMB algorithm. Within each batch j = 1,2,..., each node 
accumulates the stochastic gradients of the first b/k inputs. Then a vector-sum 
operation across the network is used to compute the average across all nodes. 
While the vector-sum operation completes in the background, a total of ^ inputs 
are processed by the processors using the same predictor Wj, but their gradients 
are not collected. Once all of the nodes have the overall average gj, each node 
updates the predictor using the same deterministic serial algorithm. 



176 



Optimal Distributed Online Prediction 



over m samples is at most 



(6 + ^)V 





m 




6 + 



Specifically, if ip{a'^ ,m) = 2D'^L + 2Da^/rn, then setting the batch size b = ml'^ gives the 
expected regret bound 

2Da^/^ + 2Dni/'' {LD + a,/Ji) + 2Darri/'- + 2Dafim~'/''' + 2fiD^L. (8) 

In fact, if b = for any p € (0, 1/2), the expected regret bound is 2Da^/rn + o{y/rn). 

To appreciate the power of this result, we compare the specific bound in Equation ([S]) 
with the ideal serial solution and the naive no-communication solution discussed in the 
introduction. It is clear that our bound is asymptotically equivalent to the ideal serial 
bound il){cr'^,m) — even the constants in the dominant terms are identical. Our bound scales 
nicely with the network latency and the cluster size /c, because // (which usually scales 
logarithmically with k) does not appear in the dominant ^/m term. On the other hand, the 
naive no-communication solution has regret bounded by kij) (cr^, "^/fc) = 2kD^L + 2Da\^km 
(see Section \2.2\] . If 1 <^ k <^ m, this bound is worse than the bound in Theorem S] by a 
factor of Vk. 

Finally, we note that choosing b as for an appropriate p requires knowledge of m in 
advance. However, this requirem ent can be relaxed by applying a standard doubling trick 
( Cesa-Bianchi and Lugosi . 20061 ). This gives a single algorithm that does not take m as 



input, with asymptotically similar regret. If we use a fixed b regardless of m, the dominant 
term of the regret bound becomes 2Da^J\og{k)m/b; see the following proof for details. 

Proof Similar to the proof of Theorem [3l we assume without loss of generality that k 
divides 6 -|- /i, we define the function / : W X M as 



/(ui, (zi,...,Zb)) 



Zs) 



s=l 



and we use Zj to denote the first b inputs in batch j. By construction, the function / is 
L-smooth and its gradients have f^^/fc-bounded variance. The average gradient gj computed 
by the DMB algorithm is the gradient of f{-,Zj) evaluated at the point wj. Therefore, 



E 



m/(6+/i) 

{fiWj,Zj)- f{w\zj)) 



< ^ 



m 



b' b + ii 



(9) 



This inequality only involve the additional p examples in counting the number of batches 
as "^/b+fi. In order to count them in the total regret, we notice that 



Vj, E[f{wj,Zj)\wj 



E 



b + p 



X] fiWj,Zi) 

i={j^l)(b+tM) + l 
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and a similar equality holds for f{w*, Zi). Substituting these equalities in the left-hand-side 
of Equation ([9]) and multiplying both sides by 6 -|- ^ yields 



E 



m/ib+fi) jib+fi) 

Yl Yl {f{Wj,Zi)- f{w\Zi)) 

j=l i={j^l){b+fl)+l 



Again, if (6 -|- ^) divides m, then the left-hand side above is exactly the expected regret of 
the DMB algorithm over m examples. Otherwise, the expected regret can only be smaller. 

For the concrete case of ip{a'^,m) = 2D^L + 2Da^/rn, plugging in the new values for cj^ 
and m results in a bound of the form 





m 







< 2{b + fi)D'^L + 2Da\lm + + 



Using the inequality ^/x + y + z < -y/x-l- y^-h-y/z, which holds for any nonnegative numbers 
X, y and z, we bound the expression above by 



2(6 + fj,)D'^L + 2Day/m + 2DaJ^ + 2Da 



b + iJL 

' Vb ■ 



It is clear that with b = Cmf for any /> G (0, 1/2) and any constant C > 0, this bound 
can be written as 2Day/m + o{y/m). Letting b = m}/^ gives the smallest exponents in the 
o{^/rn) terms. ■ 



In the proofs of Theorem [3] and Theorem [H decreasing the variance by a factor of b, as 
given in Equation ([7]), relies on properties of the Euclidean norm. For serial gradient-type al- 
gorithms that are specified with different norms (see the general framework in Appendix lA|l . 
the variance does not typically decrease as much. For example, in the dual averaging method 
specified in Equation ([3]), if we use h{w) = l/(2(p — l))||t(;||p for some p E (1,2], then the 
"variance" bounds for the stochastic gradients must be expressed in the dual norm, that 
is, E ||V^/(u;,2:) — VF(w)||g < cr^, where q = p/{p — ^) G [2,oo). In this case, the variance 
bound for the averaged function becomes 

¥.\\V^f{w,z)-VF{w)\\l < C{n,q)^, 

where C{n,q) = m.m{q — l,0(log(n))} is a space-dependent constant|l| Nevertheless, we 
can still obtain a linear reduction in b even for such non-Euclidean norms. The net effect is 
that the regret bound for the DMB algorithm becomes 2Dy/C{n, q)a^/m -|- o(-y/m). 



4. For further details o f algorithms using p-norm, see IXiaol l|2010l . Section 7.2) and 
Shalev-Shwartz and Tewaril l|201ll ). For the derivation of C{n,q) see for instance Lemma B.2 in 



Cotter et al.l (|201ll ) 
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4.2 Improving Performance on Short Input Streams 

Theorem m presents an optimal way of choosing the batch size b, which resuhs in an asymp- 
totically optimal regret bound. However, our asymptotic approach hides a potential short- 
coming that occurs when m is small. Say that we know, ahead of time, that the sequence 
length is m = 15, 000. Moreover, say that the latency is fj, = 100, and that a = 1 and 
L = 1. In this case, Theorem [J] determines that the optimal batch size is 6 ~ 25. In other 
words, for every 25 inputs that participate in the update, 100 inputs are discarded. This 
waste becomes negligible as b grows with m and does not affect our asymptotic analysis. 
However, if m is known to be small, we can take steps to improve the situation. 

Assume for simplicity that b divides fi. Now, instead of running a single distributed mini- 
batch algorithm, we run c = 1 + fi/b independent interlaced instances of the distributed 
mini-batch algorithm on each node. At any given moment, c — 1 instances are asleep and 
one instance is active. Once the active instance collects b/k gradients on each node, it starts 
a vector-sum network operation, awakens the next instance, and puts itself to sleep. Note 
that each instance awakens after (c— 1)6 = fi inputs, which is just in time for its vector-sum 
operation to complete. 

In the setting described above, c different vector-sum operations propagate concurrently 
through the network. The distributed vector sum operation is typically designed such that 
each network link is used at most once in each direction, so concurrent sum operations that 
begin at different times should not compete for network resources. The batch size should 
indeed be set such that the generated traffic does not exceed the network bandwidth limit, 
but the latency of each sum operation should not be affected by the fact that multiple sum 
operations take place at once. 

Simply interlacing c independent copies of our algorithm does not resolve the afore- 
mentioned problem, since each prediction is still defined by 1/c of the observed inputs. 
Therefore, instead of using the predictions prescribed by the individual online predictors, 
we use their average. Namely, we take the most recent prediction generated by each in- 
stance, average these predictions, and use this average in place of the original prediction. 

The advantages of this modification are not apparent from our theoretical analysis. Each 
instance of the algorithm handles m/c inputs and suffers a regret of at most 



and, using Jensen's inequality, the overall regret using the average prediction is upper 
bounded by 



The bound above is precisely the same as the bound in Theorem HI Despite this fact, we 
conjecture that this method will indeed improve empirical results when the batch size b is 
small compared to the latency term 

5. Stochastic Optimization 

As we discussed in the introduction, the stochastic optimization problem is closely related, 
but not identical, to the stochastic online prediction problem. In both cases, there is a loss 
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function /{w, z) to be minimized. The difference is in the way success is measured. In onhne 
prediction, success is measured by regret, which is the difference between the cumulative 
loss suffered by the prediction algorithm and the cumulative loss of the best fixed predictor. 
The goal of stochastic optimization is to find an approximate solution to the problem 

minimize F{w) = '&z[f{w,z)\ , 

and success is measured by the difference between the expected loss of the final output of 
the optimization algorithm and the expected loss of the true minimizer w* . As before, we 
assume that the loss function f{w,z) is convex in w for any z G Z, and that is a closed 
convex set. 

We consider the same stochastic approximation type of algorithms presented in Algo- 
rithm [H and define the final output of the algorithm, after processing m i.i.d. samples, to 
be 

^ m 
Wm = — y^Wj . 

In this case, the appropriate measure of success is the optimality gap 

G{m) = F{w^) - F{w*) . 

Notice that the optimality gap G{m) is also a random v ariable, bec ause lUm depends on 
the random samples zi, . . . , Zm ■ It can be shown (see, e.g.,Sy,H Theorem 3) that for 
convex loss functions and i.i.d. inputs, we always have 

E[G{m)] < —E[R{m)] . 
m 

Therefore, a bound on the expected optimality gap can be readily obtained from a bound 
on the expected regret of the same algorithm. In particular, if / is an L-smooth convex loss 
function and Vwf{w, z) has cr^-bounded variance, and our algorithm has a regret bound of 
^(a'^,m), then it also has an expected optimality gap of at most 

m 

For the specific regret bound i/j{a'^,m) = 2D^L + IDa^/rn, which holds for the serial 
algorithms presented in Section [21 we have 



m \ m 



5.1 Stochastic Optimization using Distributed Mini-Batches 

Our template of a DMB algorithm for stochastic optimization (see Algorithm (H) is very 
similar to the one presented for the online prediction setting. The main difference is that 
we do not have to process inputs while waiting for the vector-sum network operation to 
complete. Again let h be the batch size, and the number of batches r = . For simplicity 
of discussion, we assume that h divides m. 
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Algorithm 4: Template of DMB algorithm for stochastic optimization. 

for j = 1, 2, . . . , r do 

reset cjj =0; 

for s = 1, . . . ,b/k do 

receive input Zg sampled i.i.d. from unknown distribution; 

calculate gs = Vwf{wj,Zs); 

calculate gj ^ gj + gf, 
end 

start distributed vector sum to compute the sum of gj across all nodes; 
finish distributed vector sum and compute average gradient gj ; 
set {wj+i,aj+i) = <l){aj,gj,j); 
end 

Output: iEi=i^i 



Theorem 5 Let f{w,z) be an L-smooth convex loss function in w for each z £ Z and 
assume that the stochastic gradient V^/(w,z) has a'^-bounded variance for all w G W. If 
the update rule (/) used in a serial setting has an expected optimality gap bounded by 'ip{a'^,m), 
then the expected optimality gap of Algorithm^ after processing m samples is at most 



m 



Ifip{a^,m) 



+ ^/=; then the expected optimality gap is bounded by 

2bD'^L 2Da 
+ 

m 



m 



The proof of the theorem follows along the lines of Theorem O and isomitted. 

We com ment t hat th e accelerated stochastic gradient methods of iLan (|2009l l. iHu et al 



(|2009l ) and IXiaol (|20ld ^ can also fit in our template for the DMB algorithm, but with 
more sophisticated updating rules. These accelerated methods have an expected optimality 



bound of ip^a ,m) 
DMB algorithm: 



^^"^^Itv? + '^D'^/^/m, which translates into the following bound for the 



b' b 



+ 



m 



Most recently, Ghadimi and Lan ( 20101 ) developed accelerated stochastic gradient methods 



for strongly convex functions that have the convergence rate "0(0"^, m) = 0(1) (^/m^ + (^^/um), 
where z/ is the strong convexity parameter of the loss function. The corresponding DMB 
algorithm has a convergence rate 

\ b b J \ um J 



Apparently, this also fits in the DMB algorithm nicely. 
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The significance of our result is that the dominating factor in the convergence rate is 
not affected by the batch size. Therefore, depending on the value of m, we can use large 
batch sizes without affecting the convergence rate in a significant way. Since we can run the 
workload associated with a single batch in parallel, this theorem shows that the mini-batch 
technique is capable of turning many serial optimization algorithms into parallel ones. To 
this end, it is important to analyze the speed-up of the parallel algorithms in terms of the 
running time (wall-clock time). 

5.2 Parallel Speed-Up 

Recall that k is the number of parallel computing nodes and m is the total number of i.i.d. 
samples to be processed. Let b(m) be the batch size that depends on m. We define a time- 
unit to be the time it takes a single node to process one sample (including computing the 
gradient and updating the predictor). For convenience, let 5 be the latency of the vector-sum 
operation in the network (measured in number of time- units) H Then the parallel speed-up 
of the DMB algorithm is 

m k 



b(m) \ k 

where m/h{m) is the number of batches, and b{m)/k+S is the wall-clock time by k processors 
to finish one batch in the DMB algorithm. If b{m) increases at a fast enough rate, then 
we have S{m) — )• /c as m — )• oo. Therefore, we obtain an asymptotically linear speed-up, 
whi ch is the ideal res ult that one would hope for in parallelizing the optimization process 
(see iGustafsonl . Il988l ^. 



In the context of stochastic optimization, it is more appropriate to measure the speed-up 
with respect to the same optimality gap, not the same amount of samples processed. Let e 
be a given target for the expected optimality gap. Let mgri(e) be the number of samples that 
the serial algorithm needs to reach this target and let mDMB(e) be the number of samples 
needed by the DMB algorithm. Slightly overloading our notation, we define the parallel 
speed-up with respect to the expected optimality gap e as 

Sie) = . (10) 

T 



In the above definition, we intentionally leave the dependence of 6 on m unspecified. Indeed, 
once we fix the function 6(m), we can substitute it into the equation ^/^(o'Y*, ^/b) = e to solve 
for the exact form of ?TiDMB(e)- As a result, b is also a function of e. 

Since both msri(e) and mDMB(e) are upper bounds for the actual running times to reach 
e-optimality, their ratio S{e) may not be a precise measure of the speed-up. However, it 
is difficult in practice to measure the actual running times of the algorithms in terms of 
reaching e-optimality. So we only hope S{e) gives a conceptual guide in comparing the 
actual performance of the algorithms. The following result shows that if the batch size b is 
chosen to be of order for any p G (0, 1/2), then we still have asymptotic linear speed-up. 



5. The relationship between S and /i defined in the onUne setting (see Section [Jjl is roughly /i — kS. 
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Theorem 6 Let f{w,z) be an L-smooth convex loss function in w for each z £ Z and 
assume that the stochastic gradient Vwf{w,z) has a'^-hounded variance for all w G W. 
Suppose the update rule 4> used in the serial setting has an expected optimality gap hounded by 
'4){a'^,m) = ^^^ + ^^. If the hatch size in the DMB algorithm is chosen as b{m) = Q{mP), 
where p £ (0, 1/2), then we have 



lim ^(e) = k. 



2D'^L 2Da 
+ ^= = e 



Proof By solving the equation 



m y/m 

we see that the following number of samples is sufficient for the serial algorithm to reach 
e-optimality: 

For the DMB algorithm, we use the batch size b{m) = {^^/DL)m'', with some > 0, to 
obtain the equation 

2b(m)D^L 2Da 2Da f 9 \ 

+ ^ = + = e. 11 



We use mDMB(e) to denote the solution of the above equation. Apparently mDMB(e) is 
a monotone function of e and lime_s.o w-DMB(e) = oo. For convenience (with some abuse 
of notation), let 6(e) to denote 6(?TiDMB(e))i which is also monotone in e and satisfies 
lim£_>o 6(e) = oo. Moreover, for any batch size 6 > 1, we have mDMB(e) > "isri(e). There- 
fore, from Equation (llOp we get 

k 

lim sup 5(e) < lim j — = k. 

Next we show liminfe_5.o ^(e) > k. For any ?? > 0, let 

4DV2(l + r/)2 



m^(e) 



c2 



e^ 

which is monotone decreasing in e, and can be seen as the solution to the equation 

2Da 

(l + ri) = e. 



m 



1/2 



Comparing this equation with Equation (jlip . we see that, for any rj > 0, there exists an e' 
such that for all < e < e', we have mDMB(e) < n^rii^)- Therefore, 



liminfS'(e) > lim — j — = lim -7; — j — = -^k. 

e^o ' - e^o m^{e) I + j^k 4(l + r/)2 1 + ^fc {l + vf 
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Since the above inequality holds for any 77 > 0, we can take r/ — t- and conclude that 
liminf(;_j>o 'S'(£) ^ k. This finishes the proof. ■ 

For accelerated stochastic grad ient methods whose convergence rates have a similar de- 
pend ence on the gradient variance (|Lm] . I2OO9I : IHu et alJ . I2OO9I : IXiaol . | l201Cll : lGhadimi and Lanl . 



the batch size b has a even smaller effect on the convergence rate (see discussions 



after Theorem [5|), which implies a better parallel speed-up. 
6. Experiments 

We conducted experiments with a large-scale online binary classification problem. First, 
we obtained a log of one billion queries issued to the Internet search engine Bing. Each 
entry in the log specifies a time stamp, a query text, and the id of the user who issued the 
query (using a temporary browser cookie). A query is said to be highly monetizable if, in 
the past, users who issued this query tended to then click on online advertisements. Given 
a predefined list of one million highly monetizable queries, we observe the queries in the log 
one-by-one and attempt to predict whether the next query will be highly monetizable or 
not. A clever search engine could use this prediction to optimize the way it presents search 
results to the user. A prediction algorithm for this task must keep up with the stream of 
queries received by the search engine, which calls for a distributed solution. 

The predictions are made based on the recent query-history of the current user. For 
example, the predictor may learn that users who recently issued the queries "island weather" 
and "sunscreen reviews" (both not highly monetizable in our data) are likely to issue a 
subsequent query which is highly monetizable (say, a query like "Hawaii vacation" ) . In the 
next section, we formally define how each input, zt, is constructed. 

First, let n denote the number of distinct queries that appear in the log and assume 
that we have enumerated these queries, gi, . . . , q^. Now define xt G {0, 1}" as follows 



1 if query qj was issued by the current user during the last two hours, 
otherwise. 



Let ut be a binary variable, defined as 



yt 



if the current query is highly monetizable, 
otherwise. 



In other words, yt is the binary label that we are trying to predict. Before observing xt or 
yt, our algorithm chooses a vector wt £ M". Then xt is observed and the resulting binary 
prediction is the sign of their inner product {wt,xt)- Next, the correct label yt is revealed 
and our binary prediction is incorrect if yt{wt-,xt) < 0. We can re-state this prediction 
problem in an equivalent way by defining zt = ytXt, and saying that an incorrect prediction 
occurs when {wt-,zt) < 0. 

We adopt the logistic loss function as a smooth convex proxy to the error indicator 
function. Formally, define / as 

f{w,z) = log2 (1 + exp(-(2i;,z))) . 
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Additionally, we introduced the convex regularization constraint \\wt\\ < C, where C is a 
predefined regularization parameter. 

We ran the synchronous version of our distributed algorithm using the Euclidean dual 
averaging update rule dH) in a cluster simulation. The simulation allowed us to easily 
investigate the effects of modifying the number of nodes in the cluster and the latencies in 
the network. 

We wanted to specify a realistic latency in our simulation, which faithfully mimics the 
behavior of a real network in a search engine datacenter. To this end, we assumed that the 
nodes are connected via a standard IGbs Ethernet network. Moreover, we assumed that 
the nodes are arranged in a precomputed logical binary-tree communication structure, and 
that all communication is done along the edges in this tree. We conservatively estimated 
the round-trip latency between proximal nodes in the tree to be 0.5ms. Therefore, the total 
time to complete each vector-sum network operation is log2(A;) ms, where k is the number 
of nodes in the cluster. We assumed that our search engine receives 4 queries per ms (which 
adds up to ten billion queries a month). Overall, the number of queries discarded between 
mini-batches is fi = 41og2(A:). 

In all of our experiments, we use the algorithmic parameter aj = L + 7\/J (see The- 
orem [2]). We set the smoothness parameter L to a constant, and the parameter 7 to a 
constant divided by This is because L depends only on the loss function /, which does 
not change in DMB, while 7 is proportional to a, the standard deviation of the gradient- 
averages. We chose the constants by manually exploring the parameter space on a separate 
held-out set of 500 million queries. 

We report all of our results in terms of the average loss suffered by the online algorithm. 
This is simply defined as (1/t) f{wi,Zi). We cannot plot regret, as we do not know 
the offline risk minimizer w*. 



6.1 Serial Mini-Batching 

As a warm-up, we investigated the effects of modifying the mini-batch size 6 in a standard 
serial Euclidean dual averaging algorithm. This is equivalent to running the distributed 
simulation with a cluster size of /c = 1, with varying mini-batch size. We ran the experiment 
with b = 1,2,4,..., 1024. Figure [2] shows the results for three representative mini-batch 
sizes. The experiments tell an interesting story, which is more refined than our theoretical 
upper bounds. While the asymptotic worst-case theory implies that batch-size should have 
no significant effect, we actually observe that mini-batching accelerates the learning process 
on the first 10^ inputs. On the other hand, after 10^ inputs, a large mini-batch size begins 
to hurt us and the smaller mini-batch sizes gain the lead. This behavior is not an artifact of 
our choice of the parameters 7 and L, as we observed a similar behavior for many different 
parameter setting, during the initial stage when we tuned the parameters on a held-out set. 

Similar transie nt behaviors also exist for multi-step stochastic gradient methods (see. 



e.g.. iPolyakl . Il987l . Section 4.3.2), where the multi-step interpolation of the gradients also 
gives the smoothing effects as using averaged gradients. Typically such methods converge 
faster in the early iterations when the iterates are far from the optimal solution and the 
relative value of the stochastic noise is small, but become less effective asymptotically. 
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Figure 2: The effects of of the batch size when serial mini-batching on average loss. The 
mini-batches algorithm was applied with different batch sizes. The x-axis presents 
the number of instances observed, and the y-axis presents the average loss. Note 
that the case 6 = 1 is the standard serial dual-averaging algorithm. 
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Figure 3: Comparing DBM with the serial algorithm and the no-communication distributed 
algorithm. Results for a large cluster of A; = 1024 machines are presented on the 
left. Results for a small cluster of /c = 32 machines are presented on the right. 



6.2 Evaluating DBM 

Next, we compared the average loss of the DBM algorithm with the average loss of the 
serial algorithm and the no-communication algorithm (where each cluster node works in- 
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number of inputs 



Figure 4: The effects of increased network latency. The loss of the DMB algorithm is 
reported with different latencies as measured by In all cases, the batch size is 
fixed at 6 = 1024. 



dependently). We tried two versions of the no-communication solution. The first version 
simply runs k independent copies of the serial prediction algorithm. The second version 
runs k independent copies of the serial mini-batch algorithm, with a mini-batch size of 128. 
We included the second version of the no-communication algorithm after observing that 
mini-batching has significant advantages even in the serial setting. We experimented with 
various cluster sizes and various mini-batch sizes. As mentioned above, we set the latency 
of the DBM algorithm to = 4 log2 Taking a cue from our theoretical analysis, we 
set the batch size to 6 = m^/^ ~ 1024. We repeated the experiment for various cluster 
sizes and the results were very consistent. Figure [3] presents the average loss of the three 
algorithms for clusters of sizes k = 1024 and k = 32. Clearly, the simple no-communication 
algorithm performs very poorly compared to the others. The no-communication algorithm 
that uses mini-batch updates on each node does surprisingly well, but is still outperformed 
quite significantly by the DMB solution. 

6.3 The Effects of Latency 

Network latency results in the DMB discarding gradients, and slows down the algorithm's 
progress. The theoretical analysis shows that this waste is negligible in the asymptotic 
worst-case sense. However, latency will obviously have some negative effect on any finite 
prefix of the input stream. We examined what would happen if the single-link latency were 
much larger than our 0.5ms estimate (e.g., if the network is very congested or if the cluster 
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Figure 5: The effect of different mini-batcli sizes (6) on the DBM algorithm. The DMB algo- 
rithm was applied with different batch sizes 6 = 8,..., 4096. The loss is reported 
after 10^ instances (left), 10*^ instances (middle) and 10^ instances (right). 



nodes are scattered across multiple datacenters). Concretely, we set the cluster size to 
k = 1024 nodes, the batch size to b = 1024, and the single-link latency to 0.5, 1,2,..., 512 
ms. That is, 0.5ms mimics a realistic IGbs Ethernet link, while 512ms mimics a network 
whose latency between any two machines is 1024 times greater, namely, each vector-sum 
operation takes a full second to complete. Note that /U is still computed as before, namely, for 
latency 0.5-2^, fj, = 2P41og2(A;) = 2*'-40. Figure U] shows how the average loss curve reacts to 
four representative latencies. As expected, convergence rate degrades monotonically with 
latency. When latency is set to be 8 times greater than our realistic estimate for IGbs 
Ethernet, the effect is minor. When the latency is increased by a factor of 1024, the effect 
becomes more noticeable, but still quite small. 



6.4 Optimal Mini-Batch Size 

For our final experiment, we set out to find the optimal batch size for our problem on a 
given cluster size. Our theoretical analysis is too crude to provide a sufficient answer to this 
question. The theory basically says that setting b = Q{mP) is asymptotically optimal for 
any p € (0, 1/2), and that b = Q{rn}/^) is a pretty good concrete choice. We have already 
seen that larger batch sizes accelerate the initial learning phase, even in a serial setting. We 
set the cluster size to /c = 32 and set batch size to 8, 16, ... , 4096. Note that 6 = 32 is the 
case where each node processes a single example before engaging in a vector-sum network 
operation. Figure [5] depicts the average loss after 10^, 10®, and 10^ inputs. As noted in the 
serial case, larger batch sizes (6 = 512) are beneficial at first (m = 10''), while smaller batch 
sizes (6 = 128) are better in the end (m = 10^). 



6.5 Discussion 

We presented an empirical evaluation of the serial mini-batch algorithm and its distributed 
version, the DMB algorithm, on a realistic web-scale online prediction problem. As ex- 
pected, the DMB algorithm outperforms the naive no-communication algorithm. An in- 
teresting and somewhat unexpected observation is the fact that the use of large batches 
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improves performance even in the serial setting. Moreover, the optimal batch size seems to 
generally decrease with time. 

We also demonstrated the effect of network latency on the performance of the DMB 
algorithm. Even for relatively large values of /i, the degradation in performance was modest. 
This is an encouraging indicator of the efficiency and robustness of the DMB algorithm, 
even when implemented in a high- latency environment, such as a grid. 



7. Related Work 



In recent years there has been a growing interest in distributed online learning and dis- 
tri buted optimizat i on. 

Langford et al. (|2009l ) address the distributed online learning problem, with a similar 



motivation to ours: trying to address the scalability pr oblem of online learni ng algorithms 
which are inherently sequential. The main observation Langford et al. ( 20091 ) make is that 
in many cases, computing the gradient takes much longer than computing the update ac- 
cording to the online prediction algorithm. Therefore, they present a pipeline computational 
model. Each worker alternates between computing the gradient and computing the update 
rule. The different workers are synchronized such that no two workers perform an update 
simultaneously. 

Similar to results presented in this paper. iLangford et al. (|2009l ) attempted to show that 
it is possible to achieve a cumulative regret of O {\/rn) with k parallel workers, compared to 
the Oi^Vkrn) of the naive solution. However their work suffers from a few limitations. First, 
their proofs only hold for unconstrained convex optimization where no projection is needed. 
Second, since they work in a model where one node at a time updates a shared predictor, 
while the other nodes compute gradients, the scalability of their proposed method is limited 
by the ratio between the time it takes to compute a gradient to the time it takes to run the 
update rule of the serial on line learning algori thm. 

In another related work, Duchi et al. ( 20ld ) present a distributed dual averaging method 
for optimization over networks. They assume the loss functions are Lipschitz continuous, 
but their gradients may not be. Their method does not need synchronization to average 
gradients computed at the same point. Instead, they employ a distributed consensus al- 
gorithm on all the gradients generated by different processors at different points. When 
applied to the stochastic online prediction setting, even for the most favorable class of com- 
munication graphs, with constant spectral gaps (e.g., expander graphs), their best regret 
bound is 0(\//cm log(m)) . This bound is no better than one would get by running k parallel 
machines without communi cation (see Secti on 12. 2p . 

In another recent work, Zinkevich et al. ( 2010l ) study a method where each node in the 
network runs the classic stochastic gradient method, using random subsets of the overall 
data set, and only aggregate their solutions in the end (by averaging their final weight 
vectors). In terms of online regret, it is obviously the same as running k machines indepen- 
dently without communication. So a more suitable measure is the optimality gap (defined in 
Section ED of the final averaged predictor. Even with respect to this measure, their expected 
optimality gap does not show adv antage over runnin g fc ma chines independently. A similar 
approach was also considered by Nesterov and Viall (2008 ) and an experimental study of 
such a method was reported in [Harrington et al.l ^20o£ . 
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A key difference between our DMB framework and many related work is that DMB does 
not consider distributed comuting as a constraint to overcome. Instead, our novel use of 
tfie variance-based regret bounds can exploit parallel/distributed computing to obtain the 
asymptotic optimal regret bound. Beyond the asymptotic optimality of our bounds, our 
work has other features that set it apart from previous work. As far as we know, we are 
the first to propose a general principled framework for distributing many gradient-based 
update rule, with a concrete regret analysis for the large family of mirror descent and dual 
averaging update rules. Additionally, our work is the first to explicitly include network 
latency in our regret analysis, and to theoretically guarantee that a large latency can be 
overcome by setting parameters appropriately. 

8. Conclusions and P\irther Research 

The increase in serial computing power of modern computers is out-paced by the growth 
rate of web-scale prediction problems and data sets. Therefore, it is necessary to adopt 
techniques that can harness the power of parallel and distributed computers. 

In this work wc studied the problems of distributed stochastic online prediction and 
distributed stochastic optimization. We presented a family of distributed online algorithms 
with asymptotically optimal regret and optimality gap guarantees. Our algorithms use 
the distributed computing infrastructure to reduce the variance of stochastic gradients, 
which essentially reduces the noise in the algorithm's updates. Our analysis shows that 
asymptotically, a distributed computing system can perform as well as a hypothetical fast 
serial computer. This result is far from trivial, and much of the prior art in the field did 
not show any provable gain by using distributed computers. 

While the focus of this work is the theoretical analysis of a distributed online prediction 
algorithm, we also presented experiments on a large-scale real-world problem. Our exper- 
iments showed that indeed the DMB algorithm outperforms other simple solutions. They 
also suggested that improvements can be made by optimizing the batch size and adjusting 
the learning rate based on empirical measures. 

Our formal analysis hinges on the fact that the regret bounds of many stochastic online 
update rules scale with the variance of the stochastic gradients when the loss function is 
smooth. It is unclear if smoothness is a necessary condition, or if it can be replaced with 
a weaker assumption. In principle, our results apply in a broader setting. For any serial 
update rule (p with a regret bound of 'tp{a'^,m) = Ca^fm^ o{^^fra)^ the DMB algorithm and 
its variants have the optimal regret bound of Ca^fm^ o{y/rn), provided that the bound 
il){(j^,m) applies equally to the function / and to the function 

1 ^ 

f {W,{Z1,. . . ,Zb)) = -^f{w,Zs) . 

s = l 

Note that this result holds independently of the network size k and the network latency fi. 
Extending our results to non-smooth functions is an interesting open problem. A more 
ambitious challenge is to extend our results to the non-stochastic case, where inputs may 
be chosen by an adversary. 

An important future direction is to develop distributed learning algorithms that perform 
robustly and efficiently on heterogeneous clusters and in asynchronous distributed environ- 
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merits. This direction has been further explored in iDekel et alj (|201lh . For example, one 
can use the following simple reformulation of the DMB algorithm in a master-workers set- 
ting: each worker process inputs at its own pace and periodically sends the accumulated 
gradients to the master; the master applies the update rule whenever the number of accu- 
mulated gradients reaches a certain threshold and broadcasts the new predictor back to the 
workers. In a dynamic environment, where the network can be partitioned and reconnected 
and where nodes can be added and removed, a new master (or master s) can be chosen a s 
needed by a standard leader election algorithm. We refer the reader to lDekel et al.l (|201ll ) 
for more details. 

A central property of our method is that all of the gradients in a batch must be taken 
at t he same predicti o n poi i it. In an asynchronous d i stribu ted computing environment (see. 



Tsitsiklis etHI . Il986l : iBertsekas and Tsitsiklisl . Il989l ). this can be quite wasteful. In 



e.g. 

order to reduce the waste generated by the need for global synchronization, we may need 
to allow different nodes to accumulate gradients at different yet close points. Such a modi- 
fication is likely to work since the smoothness assumption precisely states that gradients of 
nearby points are similar. There have been extensive studies on distributed optimization 
with inaccurate or delayed subgradient information, bu t most ly without the smoothness 
assumption (e.g., Nedic et al. . 2001 : Nedic and Ozdaglar . 20091 ). We believe that our main 
results under the smoothness assumption can be extended to asynchronous and distributed 
environments as well. 
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Appendix A. Smooth Stochastic Online Prediction in the Serial Setting 

In this appendix, we prove expected regret bounds for stochastic dual averaging and stochas- 
tic mirror descent applied to smooth loss functions. In the main body of the paper, we 
discussed only the Euclidean special case of these algorithms, while here we present the 
algorithms and regret bounds in their full generality. In particular, Theorem [1] is a special 
case of Theorem [9l and Theorem [2] is a special case of Theorem [71 

Recall that we observe a stochastic sequence of inputs zi,Z2,..., where each Zi G Z. 
Before observing each Zi we predict Wi £ W, and suffer a loss f{wi,Zi). We assume W is 
a closed convex subset of a finite dimensional vector space V with endowed norm || • ||. We 
assume that f{w,z) is convex and differentiable in w, and we use Vwf{w,z) to denote the 
gradient of / with respect to its first argument. Vwf{w, z) is a vector in the dual space V* , 
with endowed norm || • ||*. 

We assume that f{-,z) is L-smooth for any realization of z. Namely, we assume that 
f{-,z) is differentiable and that 

yzGZ, yWjw'GW, \\Vwf{w,z) — 'Vwf{'w',z)\\^:<L\\w — w'\\. 
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We d efine F(w) = Ez[f{w, z)] and note that V^F(u;) = EzlV^fiw, z)] (see Rockafellar and Wetd . 



19821 ) ■ This implies that 

yw,w' G W, \\V^F{w) -V^,F{w')\U < L\\w-w'\\ . 
In addition, we assume that there exists a constant a >0 such that 

yw€W, E,[\\V^f{w,z)-V^E,[f{w,z)]\\l]<a^ . 

We assume that w* = arg min^g^y exists, and we abbreviate F* = F{w*). 

Under the above assumptions, we are concerned with bounding the expected regret 
E[i2(m)], where regret is defined as 

m 

R{m) = ^{f{wi,Zi)- f{w\zi)) . 

i=l 

In order to present the algorithms in their full generality, we first recall the concepts of 
strongly convex function and Bregman divergence. 

A function h : W ^ {+00} is said to be ^-strongly convex with respect to || • || if 

VaG[0, 1], \/u,v&W, h{au + {I - a)v) < ah{u) + {I - a)h{v) - ^a{l - a)\\u - v\\^ . 

If h is //-strongly convex then for any u G dom /i, and v G dom h that is sub-differentiable, 
then 

Ms G dh{v), h{u) > h{v) + {s,u — v) + ^ \\u — . 



(See, e.g., Goebel and Rockafellail . 20081 .) If a function h is strictly convex and differentiable 



(on an open set contained in dom/i), then we can defined the Bregman divergence generated 
by h as 

dh{u, v) = h{u) — h{v) — (V/i(w), u — v) . 

We often drop the subscript h in when it is obvious from the context. Some key properties 
of the Bregman divergence are: 

• d{u, v) > 0, and the equality holds if and only ii u = v. 

• In general d{u,v) ^ d{v,u), and d may not satisfy the triangle inequality. 

• The following three-point identity follows directly from the definition: 

d{u, w) = d{u, v) + d{v, w) + {Vh{v) — Vh{w),u — v) . 

The following inequality is a direct consequence of the /i-strong convexity of h: 

d{u,v) > ^\\u-vf . (12) 
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A.l Stochastic Dual Averaging 

The proof techniques for the stochastic dual averagin g m ethod are ad apted from those for 
the accelerated algorithms presented in iTsend and|2£iad (|20ld l. 

Let /i : — 7- M be a 1-strongly convex function. Without loss of generality, we can 
assume that min^ugvy h{w) = 0. In the stochastic dual averaging method, we predict each 
Wi by 



Wj+i = arg mm 



E 



gj,w ) + {L + I3i+i)h{w) 



(13) 



where gj denotes the stochastic gradient V.wfiwj,Zj), and (/3i)i>i is a sequence of positive 
and nondecreasing parameters (i.e., /3j+i > /3j). As a special case of the above, we initialize 
wi to 

t^i = arg min /i(tti) . (14) 

We are now ready to state a bound on the expected regret of the dual averaging method, 
in the smooth stochastic case. 



Theorem 7 The expected regret of the stochastic dual averaging method is bounded as 

2 m~l ^ 

Vm, E[R{m)] < {F{wi) - F{w*)) + (L + /3„)/i(«;") + ^ ^- 

Pi 



i=l 



The optimal choice of /?j is exactly of order More specifically, let Pi = jVi, where 7 
is a positive parameter. Then Theorem [7] implies that 



E[R{m)] < {F{wi) - F{w*)) + Lh{w*) + ( jh{w*) + — | vm. 



Choosing 7 = 0"/ ^y h{w*) gives 

E[R{m)] < {F{wi) - F{w'')) + Lh{w'') + (2a^h{w*)) 



If 'VF{w*) = (this is certainly the case if W is the whole space), then we have 

F{wi) - F{w*) < ^\\wi - w*f < Lh{w*). 
Then the expected regret bound can be simplified as 



E[R{m)] < 2Lh{w*) + [2ay/h{ 



m. 



To prove T heorem [71 we req u ire the following f undainent al lemma , which can be found 



io prove 1 neorem uJ we req u ire tne toliowmg tuimamentai l€ 
for example, in lNesterovl jiooJ), iTsend (|2008. ) and (|20ld ). 
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Lemma 8 Let W be a closed convex set, ip be a convex function on W , and h be ^-strongly 
convex on W with respect to || • ||. // 



then 



w'^ = argmin|(^(u;) + h{w)^, 



\/ w ^W, ^{w) + h{w) > <p{w^) + h{w~^) + —\\w — w'^W . 



With Lemma [8l we are now ready to prove Theorem [71 
Proof First, we define the hnear functions 

£i{w) = F{vui) + {VF{wi),w -Wi), Vi>l, 

and (using the notation gi = V f{wi, Zi)) 

ii{w) = F{wi) + {gi,w- Wi) = ii{w) + {qi,w- Wi), 

where 

qi = Qi- VF{wi). 

Therefore, the stochastic dual averaging method specified in Equation (fTSll is equivalent to 

Wi = argmin < ''^^ij{w) + (L + /3i)h{w 



Using the smoothness assumption, we have (e.g.. lNesterovll2004l . Lemma 1.2.3) 

Fiwi+i) < £i{wi+i) + ^\\wi+i - WiW"^ 

Of \ I ~^ W l|2 / \ II l|2 

= £i(wi+i)H ^ — \\wi+i-Wi\\ - {qi,Wi+i - Wi) - —\\wi+i - Wi\\ 

^ ~^ f^i II... ... ii2 I II II II II ft II ||2 



< ii{wi+i)^ — \\wi+i-Wi\\ + \\qi\\^\\wi+i - Wi\\ - —\\wi+i - w. 



2 .. 



M ^^:^±^ii ii2 f 1 II II /ft II 11^ ^ 

ti[Wi+i) A — \\wi+i-Wi\\ - \ —=\\qi\\^ - \ —\\wi+i - Will ] + 



V 2 " 2/3i 

^ n ( ^|-^^~ft|l ii2ill^ill* /it-\ 

< £j(wj+i) H — \\wiJ^x-Wi\\ (15) 

z Zpi 
Next we use Lemma[8]with '^{w) = £jiw) and fi = {L + f3i), 

^-1 L + B- 

^£j{wi+i) + (L + I3i)h{wi+i) > ^ijiwi) + {L + pi)h{wi) + -^^\\wi+i - Wif, 

i=i i=i 
Combining the above inequality with Equation (jlSp . we have 

II ||2 

F{wi+i) < ii{wi+i) + Y,^£j{wi+i) + (L + Pi)h{wi+i) - ^ij{wi) - (L + ^^)h{wi) + 

i=i i=i 

"ill* 



< ^lj('«;i+i) + (L + /3i+i)/i(u;i+i) -Y^^jim) - {L + Pi)h{wi) + 



2A 
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where in the last inequahty, we used the assumptions /3j+i > /3j > and h{wi^i) > 0. 
Summing the above inequahty from i = lto2 = m — 1, we have 



m— 1 m— 1 II 



j=2 i=\ 

m—1 m— 1 II i|2 



< 5^ Uw') + {L + + ^ 



.2ft 

1=1 4=1 

m—1 m—1 11 m—1 

< Yl ^*(^') + + /3m)/iK) + E ^ + E ^* - «^^) 

1=1 j=l ^* i=l 

m—1 II ||2 m—1 

< (m - + (L + ft)/i(u;*) + ^ M± + ^ _ 

i=i 



Therefore, 



m m—1 II II 2 m—1 

Y{Hw^) - Fiw"-)) <{L + Mh{w*) + M± + ^ (g^, _ u,,). (16) 

j=2 i=l i=l 

Notice that each Wi is a deterministic function of zi, . . . , ^j-i, so 

IE2,((gi,w* - Wi) \zi,.. .,Zi-i) = 

by recalhng the definition qi = Vf{'Wi,Zi) — VF{wi). Taking expectation of both sides of 
Equation (fT6|) with respect to zi, . . . , Zm, and adding the term F^wi) — F{w*), we have 

m m—1 2 

E^iFiw^) - F{w*)) < F{w,) - F{w*) + (L + ^M^*) + E 
i=i i=i 

Theorem [7] is proved by further noticing 

Ef{w^,Zi)=EF{wi), Ef{w*,Zi) = F{w*), Vi > 1, 
which are due to the fact that Wi is a deterministic function of zq, . . . , Zj-i- ■ 



A. 2 Stochastic Mirror Descent 



Variance-based conve rgence rates for the stochastic Mirror Descent methods are due to 
Juditskv et al.l (l2011h . and were extended to an accelerated stochastic Mirror Descent 
method bv iLan ([2009). For completeness, we adapt their proofs to the context of regret for 
online prediction problems. 

Again let /i : — )■ M be a differentiable 1-strongly convex function with min^^igpi/ h{w) = 
0. Also let d be the Bregman divergence generated by h. In the stochastic mirror descent 
method, we use the same initialization as in the dual averaging method (see Equation (fH|) ) 
and then we set 



'i+i 



arg mm 



[{9i,w) + {L + l3i)d{w,Wi)'j, 



i > 1. 
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As in the dual averaging method, we assume that the sequence (/3i)i>i to be positive and 
nondecr easing. 

Theorem 9 Assume that the convex set W is closed and bounded. In addition assume 
d{u, v) is bounded on W and let 

= max d{u, v). 

U,VS:W 

Then the expected regret of the stochastic mirror descent method is bounded as 

2 m—l ^ 

E[R{m)] < {F{w,) - F{w'')) + (L + (3m)D' + ^ J] ^. 

^ i=i 

Similar to the dual averaging case, using the sequence of parameters Pi = {a /D)\/i gives 
the expected regret bound 

E[i?(m)] < {F{wi) - F{w*)) + LD^ + {2aD) 

Again, if VF{w*) = 0, we have F{wi) - F{w*) < {L/2)\\wi - w*\\^ < Lh{w*) < LD"^, thus 
the simplified bound 

E[R{m)] < 2LD^ + {2aD) ^/^. 

We note that here we have stronger assumptions than in the dual averaging case. These 
assumptions are certainly satisfied by using the standard Euclidean distance d{u, v) = 
{l/2)\\u — on a compact convex set W. However, it excludes the case of using the 
KL-divergence d{u,v) = log(ni/f .j) on the simplex, because the KL-divergence is 

unbounded on the simplex. Nevertheless, it is possible to remove such restrictions by 
considering other variants of the stochastic mirror descent method. For example, if we 
use a constant /3j that depends on the prior knowledge of the number of total steps to be 
performed, then we can weaken the assumption and replace D in the above bounds by 
y/h{w*). More precisely, we have 

Theorem 10 Suppose we know the total number of steps m to be performed by the stochas- 
tic mirror descent method ahead of time. Then by using the initialization in Equation ^14\ ) 
and the constant parameter 

Pi = — \/^, 

y^2h^y 

we have the expected regret bound 



E[R{m)] < {F{wi) - F{w*)) + Lh{w*) + ayj2h{w*)y/m. 



Theorem [TU] is essentially the same as a result in who also developed an acceler- 



ated versions of the stochastic mirror descent method. To prove Theorem [9] and Theorem [TOl 



we need the follo wing standard Le mma, which can be found in lChen and Teboulld (| 19931 ). 
Lan et al.l (|201lh andlTsengl (|2008l l. 
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Lemma 11 Let W be a closed convex set, ip be a convex function on W , and h be a 
differentiable, strongly convex function on W . Let d be the Bregman divergence generated 
by h. Given u G W , if 

= argmin {^p{w) + d{w,u)^, 
w<^W 

then 

(p{uj) + d{w, u) > tp{w'^) + d{w'^ ,u) + d{w, vj^). 

We are ready to prove Theorem [9] and Theorem [TOl 
Proof We start with the inequahty in Equation (llSp . Using Equation (I12p with /U = 1 
gives 

|U.||2 

F{wi+l)<e^{w^+l) + {L + p^)d{wi+l,Wi)+^-f^. (17) 

Now using Lemma [TT] with ip{w) = £i{w) yields 

ii{w^+l) + {L + ^i)d{wi+l,W^) < ii{w*) + {L + ^i)d{w* , W,^ - {L + f3i)d{w* , Wi+i) . 

Combining with Equation pT|) gives 



Fiwi+i) < ei{w*) + {L + Pi)d{w*,Wi)-{L + Pi)d{w\wi+i) + 



2Pi 

= liiw*) + {L + pi)d{w\wi) - (L + P,+i)d{w* ,Wi+i) + (A+i - Pi)d{w\w^+i) 
llo-lP 

< F{w*) + {L + (3,)d{w\wi) - (L + Pi+i)d{w*, w,+i) + (A+i - pi)D^ 

|U.||2 

where in the last inequality, we used the definition of and the assumption that /3j+i > /3j 
Summing the above inequality from z = lto2 = m — 1, we have 

m 

^F{wi) < {m-l)F(w'') + {L + pi)d{w\wi)-{L + pm)d{w\wm) + {Pm-Pi)D^ 

i=2 

i=l i=l 

Notice that d{w*,Wi) > and d{w*,wi) < D^, so we have 



< (m - + {L + (3m)D' + ^ Ml + ^ (g„ _ ^,). 

i=2 i=l i=l 



The rest of the proof for Theorem [9] is similar to that for the dual averaging method (see 
arguments following Equation (jl6p ). 
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Finally we prove Theorem [TOl From the proof of Theorem [9] above, we see that if 
Pi = (3m is a constant for all i = 1, . . . , m, then we have 

m 2 rn—1 

eY,{F{w,) - Fiw^) <{L + fim)d{w\wi) + ^ ^ -. 

i=2 i=l ^ 

Notice that for the above result, we do not need to assume boundedness of W ^ nor bound- 
edness of the Bregman divergence d{u,v). Since we use wi = aigmin^^-^^ h{w) and as- 
sume h{wi) = (without loss of generality), it follows d{'w*,wi) < h{w*). Plugging in 
(3m = {c / \/2h{w*))^/rn gives the desired result. ■ 



Appendix B. High-Probability Bounds 

For simplicity, the theorems stated throughout the paper involved bounds on the expected 
regret, E[i?(m)]. A stronger type of result is a high-probability bound, where R{m) itself 
is bounded with arbitrarily high probability 1 — 5, and the bound having only logarithmic 
dependence on 5. Here, we demonstrate how our theorems can be extended to such high- 
probability bounds. 

First, we need to justify that the expected regret bounds for the online prediction rules 
discussed in Appendix \K\ have high-probability versions. For simplicity, we will focus on a 
high-probability version of the regret bound for dual averaging (Theorem[7]), but exactly the 
same technique will work for stochastic mirror descent (Theorem [9] and Theorem [TOl) . With 
these results in hand, we will show how our main theorem for distributed learning using 
the DMB algorithm (Theorem |4]) can be extended to a high-probability version. Identical 
techniques will work for the other theorems presented in the paper. 

Before we begin, we will need to make a few additional mild assumptions. First, we 
assume that there are positive constants B, G such that \f{w, z)\ < B and \\Vwf{w-, < C 
for all w G and z ^ Z. Second, we assume that there is a positive constant a such that 
\w:z{f{w,z) — f{w*,z)) < fj^ for all w £ W (note that fj^ < 45^ always holds). Third, 
that W has a bounded diameter D, namely \\w — w'\\ < D for all w, w' G W. 

Under these assumptions, we can show the following high-probability version of Theo- 
rem [71 

Theorem 12 For any m and any 6 G (0, 1], the regret of the stochastic dual averaging 
method is bounded with probability at least 1 — 6 over the sampling of zi, . . . , Zm by 

m—l 



R{m) < {Fiw,) - F{w*)) + (L + (3m)h{w*) + y E ^ 

1=1 



+ 21og(2/<5) DG + 



2G' 



m 



+ 41og(2A)5jl+ 



log(2/5)- 
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Proof The proof of the theorem is identical to the one of Theorem [71 up to Equation (jl6p : 

m m—1 II ||2 m— 1 

- F{wn) <{L + MHw*) + ^ ML + ^ (g,, - w,). (18) 

i=2 1=1 ^* i=l 

In the proof of Theorem [Tj we proceeded by taking expectations of both sides with respect 
to the sequence zi, . . . ,Zm- Here, we will do things a bit differently. 

T he main technical tool we use is a well-known Bernstein-type inequality for martingales 
(e.g.. ICesa-Bianchi and Lugosil . I2OO6I . Lemma A. 8), an immediate corollary of which can be 
stated as follows: suppose xi,...,Xm is a martingale difference sequence with respect to 
the sequence zi, . . . ,Zm, such that \xi\ < b, and let 



^Var(xi|zi, . . . 



i=l 



Then for any 6 G (0, 1), it holds with probability at least 1 — 5 that 



gx.<Mog(W^l + j^. (19) 

Recall the definition qi = V f{wi, Zi) — 'VF{wi), and let af = E[||gj|p]. Note that af < . 
We will first use this result for the sequence 

= — ^ K {qi,w - wi)- 

It is easily seen that E^Jxil^i, . . . , Zi-i] = 0, so it is indeed a martingale difference sequence 
w.r.t. zi, . . . ,Zm- Moreover, \{qi,w* — Wi )| < D\\qi\\ < 2DG, \\qi\\^ < 4G^. In terms of 
the variances, let Var^. and E^. be shorthand for the variance (resp. expectation) over Zi 
conditioned over zi, . . . , Zi^i. Then 

Var,^(x,) < 2Var,^ + 2Var,, - Wi)) 

- (^) + 2E.J((ft,t«' - w,)f] 

2 2 

Combining these observations with Equation ([19]) . we get that with probability at least 
l-S, 



|U.||2 _ ^2 / Ar^\ 

E + - ^ + — J log(l/5)^ 1 + 36 



bi(i7^) ■ 

(20) 
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A similar type of bound can be derived for the sequence Xi = {f{wi, Zi) — f{w*, Zi)) — 
{F{wi) — F{w*)). It is easily verified to be a martingale difference sequence w.r.t. zi, . . . , z^, 
since 

E [{f{wu Zi) - f{w\ z.i)) - {F{wi) - F{w*)) \zi, = 0. 

Also, 

\{f{wu Zi) - f{w\ Zi)) - {F{wi) - F{w*))\ < AB, 

and 



Var,, {{fiwi,z,)- f{w\zi)) - {F{w^) - F{w*))) = Vav,^{f{wi,Zi)-f{w*,z,)) 

< ■ 

So again using Equation (jl9p , we have that with probability at least 1 — 5 that 



E (/(^., zi) - f{w\ Zi)) - {F{w,) - FK)) < AB log(l/5) Jl + . (21) 



i=l 



Finally, adding F{wi) — F{'w*) to both sides of Equation (fTSi) . and combining Equa- 
tion ([20|l and Equation (j2T]) with a union bound, the result follows. ■ 



Comparing the theorem to Theorem [71 and assuming that A = e{Vi), we see that the 
bound has additional 0{y/rn) terms. However, the bound retains the important property 
of having the dominant terms multiplied by the variances o"^,(T^. Both variances become 
smaller in the mini-batch setting, where the update rules are applied over averages of h such 
functions and their gradients. As we did earlier in the paper, let us think of this bound as 
an abstract function ?/)(cr^, o"^, (5, m). Notice that now, the regret bound also depends on the 
function variance o"^, and the confidence parameter 5. 

Theorem 13 Let f is an L-smooth convex loss function. Assume that the stochastic gra- 
dient Vwf{w, Zi) is bounded by a constant and has a'^ -bounded variance for all i and all w, 
and that f(w,Zi) is bounded by a constant and has a'^-bounded variance for all i and for 
all w. If the update rule cj) has a serial high-probability regret bound ip{a'^,a'^,6,m). then 
with probability at least 1 — 6, the total regret of Algorithmic over m examples is at most 

(, + (^!, ^, M + ^) + O (.^(i + 9 log(i/*)„,) . 

Comparing the obtained bound to the one in Theorem U we note that we pay an 
additional 0{^/m) factor. 

Proof The proof closely resembles the one of Theorem [H We let zj denote the first b 
inputs on batch j, and define / as the average loss on these inputs. Note that for any w, 
the variance of f{w,Zj) is at most cr^ /b, and the variance of Vwf{w,z) is at most a^/b. 
Therefore, with probability at least 1 — 5, it holds that 

^(/>„z,)-/>*,z,)) < Vfy,y,'5,mj . (22) 
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where rh is the number of inputs given to the update rule (p. Let Zj denote the set of all 
examples received between the commencement of batch j and the commencement of batch 
j + 1, including the vector-sum phase in between {b + fj, examples overall). In the proof of 
Theorem m we had that 



E [{f{wj,Zj) - fiw*,Zj)) \wj] 



E 



and thus the expected value of the left-hand side of Equation ()22p equals the total regret, 
divided by 6 + //. Here, we need to work a bit harder. To do so, note that the sequence of 
random variables 



indexed by j, is a martingale difference sequence with respect to Zx^Z^, ■ ■ ■■ Moreover, 
conditioned on Zi, . . . , Zj-\^ the variance of each such random variable is at most 4o"^/6. 
To see why, note that the first sum has conditional variance since the summands are 

independent and each has variance . Similarly, the second sum has conditional variance 
(7^/(6 + /i) < jh. Applying the Bernstein-type inequality for martingales discussed in the 
proof of Theorem [T2l we get that with probability at least 1 — 5, 



in ^ in ^ I 



i=i 



m log(l/(5) 



where the O-notation hides only a (linear) dependence on the absolute bound over \f{w, z)\ 
for all w, z, that we assume to hold. 

Combining this and Equation (|22p with a union bound, we get that with probability at 
least 1 — 5, 



j=lz&Zj ^ \ 

If 6 + /u divides m, then rh = m/{b + fi), and we get a bound of the form 



mlog(l/5) 



^2 <t2 



m 



+ 0{a^l(l + ^)log{l/6)m 



Otherwise, we repeat the ideas of Theorem [3] to get the regret bound. 
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